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Abstract: When applying the support vector machine (SVM) to high- 
dimensional classification problems, we often impose a sparse structure 
in the SVM to eliminate the influences of the irrelevant predictors. The 
lasso and other variable selection techniques have been successfully used 
in the SVM to perform automatic variable selection. In some problems, 
there is a natural hierarchical structure among the variables. Thus, in or- 
der to have an interpretable SVM classifier, it is important to respect the 
heredity principle when enforcing the sparsity in the SVM. Many variable 
selection methods, however, do not respect the heredity principle. In this 
paper we enforce both sparsity and the heredity principle in the SVM by 
using the so-called structured variable selection (SVS) framework originally 
proposed in [20l ] . We minimize the empirical hinge loss under a set of linear 
inequality constraints and a lasso-type penalty. The solution always obeys 
the desired heredity principle and enjoys sparsity. The new SVM classi- 
fier can be efficiently fitted, because the optimization problem is a linear 
program. Another contribution of this work is to present a nonparametric 
extension of the SVS framework, and we propose nonparametric heredity 
SVMs. Simulated and real data are used to illustrate the merits of the 
proposed method. 
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1. Introduction 

The support vector machine (SVM) is a widely used classification method. Let 
x denote a generic feature vector. The class labels, y, are coded as {1, —1}. For 
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a given training data set {xi, yi},i = 1,2, . . . , n, the SVM can be expressed in 



a penalized hinge loss formulation (cf. [ll| and |l5j ) 



0J O ) = argmin^ [l - Vi (xf + fo)] + + (1.1) 

i— 1 

where the subscript "+" means the positive part (z + = max(z,0)). The SVM 
classifier is Sign(/3o+x T /3). It is now well known that by imposing some structure 
in the SVM, we could significantly enhance its classification performance and 
obtain a more interpretable model [111 ]. For example, when the dimension of the 
predictors is high and there are many irrelevant predictors, imposing sparsity 
in (3 via an automatic variable selection procedure can significantly enhance 
classification performance of the SVM. Various variable selection proposals have 
been introduced in recent years to encourage sparsity in (3 for the SVM. See [25| 
and references therein. In particular, Bradley and Mangasarian [l| and Zhu ct 
al. Q suggested to replace the quadratic penalty in (jl.l|) with the lasso (or l{) 
penalty: 

n 

^[1- Vi (xf0 + A))] + + (1-2) 
1=1 

Similar to the lasso [HI for linear regression, the lasso penalty encourages some 
of the (3 coefficients to exact zero and therefore perform variable selection. 

Despite their successes, these general-purpose variable selection methods do 
not take advantage of the possible interrelationship among features. Consider 
for example a quadratic classifier with explanatory variables z\, z-i,. ■ ., z q : 

$\Z\ + . . . + (3 q Z q + f3\\z\ + fi\2Z\Z 2 + . . . + {3 q%q -\Z q Z q -\ + (3 qq z 2 q . (1.3) 

In employing the l\ SVM to learn the (3 coefficients, one may consider using 
x = [z\, . . . , z q , z\Zi, ■ ■ ■ , z q ^iz q , z\, . . . , z^) as the derived variables in (|1.2[) . In 
doing so, we neglect the difference between quadratic effects and linear effects. 
In situations like this, it is desirable to invoke the effect heredity principle flij ]. 
There are two popular versions of the effect heredity 0] • Under the strong hered- 
ity, for a two-factor interaction effect ZiZj to be active both its parent effects, Zi 
and Zj, should be active, whereas under the weak heredity only one of its parent 
effects needs to be active. Likewise, one may also require that z| is allowed to 
be active only if Zj is active. In this paper we develop a new method that can 
simultaneously impose the sparse structure and the heredity structure in the 
SVM model. 

Earlier interests in the heredity principle came from the analysis of designed 
experiments where heredity principle had proven to be powerful tools in resolv- 
ing complex aliasing patterns (cf. Q, 0] and Q). The heredity principle was 
routinely followed in general regression problems as well. Efron et al. [7[ and 
Turlach [3| discussed how to enforce the strong heredity principle in the efficient 
Lars algorithm. Later, Yuan, Joseph and Lin [19( proposed more flexible ways of 
incorporating the strong and weak heredity principles in linear regression. Zhao, 
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Rocha and Yu [23] presented the Composite Absolute Penalties which can pro- 
duce a hierarchical model. Choi and Zhu Q proposed a penalization method for 
enforcing the strong heredity principle in fitting a regression model. However, 
these earlier methods are primarily designed for the linear regression model. It 
is not clear how to generalize them to handle other models such as the SVM 
considered in the present paper and still retain their computational efficiency. 

More recently, Yuan, Joseph and Zou [ 20| formalized the concept of struc- 
ture variable selection to describe general hierarchical structures among vari- 
ables with traditional heredity principles as special cases when doing variable 
selection. They argue that appropriately accounting for the general hierarchi- 
cal structure among variables not only enhances the model interpretability but 
also leads to improved estimation and prediction. The SVS framework gives a 
unified treatment of the linear regression model and generalized linear models. 
In addition, the SVS framework permits a very efficient implementation and 
enjoys nice theoretical properties. 

In this paper, we propose to adopt the SVS framework to simultaneously in- 
corporate the heredity principle and sparsity into the support vector machine in 
a way that retains the computational advantages of the SVM. The main idea is 
to introduce a scaling parameter to each effect and then enforce the hierarchical 
relationships among predictors and sparsity by a set of linear inequality con- 
straints on the corresponding scaling parameters. As a result, the optimization 
problem is a linear program and can be very efficiently solved using standard 
linear programming techniques. Our approach can handle both strong and weak 
heredity principles. Furthermore, we propose a nonparametric extension of the 
SVS framework based on which we develop nonparametric heredity SVMs. 

The rest of the paper is organized as follows. In the next section, we describe 
how to employ the SVS idea to incorporate the strong heredity principle into 
the SVM. The weak heredity principle can be implemented in a similar fashion 
with an additional convex relaxation step, in order to preserve the computational 
efficiency. In Section [3] we propose the nonparametric heredity SVMs. Section [4] 
contains some discussion. 

2. The Generalized Garrote and Heredity Principles 
2.1. Method 

Breiman's nonnegative garrote [2j is perhaps the first method in the literature 
that uses an l\ constraint to perform variable selection in linear regression mod- 
els. As an extension of the original nonnegative garrote, the generalized garrote 
is first introduced in [2(| to build the SVS framework. Here we show that the 
generalized garrote idea can be used in the support vector machine as well. To 
provide the readers a complete picture, we first introduce the basic idea of the 
garrote in the context of the SVM. Suppose we have computed the li SVM 
coefficients $, then we introduce a scaling parameter 8j for each predictor xj 
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and then solve the following optimization problem 

min^lA E"=i 1 ~ 2/i(Ej=i x ijk e i + A) 



(2.1) 



subject to 



< M 



E? =1 «j 

^>0 Vj 



where M is the garrote shrinkage parameter. The new classifier is Sign(/3o + 
E?=i x jPj@j)- To compare it with the l\ SVM, we consider another equivalent 
formulation of (12.11) 



En 
i=l 



subject to 



1 U.iYJj ; ■'■,, {,(!, 

9j>0 Vj. 



■A)) 



When M or A is properly chosen, some 9jS will be shrunk to zero, and thus the 
corresponding predictors (xjs) will be deleted from the classifier. Therefore, the 
garrote performs variable selection in a way similar to the lasso. 

The garrote received little attention in the literature compared to the enor- 
mous popularity of the lasso. Recently, Yuan and Lin [22j showed the garrote 
enjoys excellent finite sample performance if we use some regularized estimators 
as the initial estimator. The biggest advantage of the garrote, however, is its 
flexibility. We can easily modify the garrote by adding other linear constraints on 
the scaling parameters to meet some special requirements, such as the heredity 
principle. 

We adopt some notation from [2(| to formally describe general hierarchical 
structures among variables. Suppose the dimension of the predictor set is p. The 
hierarchical relationships among predictors can be represented by sets {T>j : j — 
1, . . . ,p}, where T>j contains the parent effects of the jth predictor. Consider, 
for example, the predictors in model ()1.3|) . The q + 1th predictor is x q +\ = z\z-2 
and its parent effects are x\ = z\ and X2 = Zi. Thus T> q+ \ = {1, 2}. 

The strong heredity principle says that if the jth predictor can be consid- 
ered for inclusion, all elements of Vj must be included. Note that in the gar- 
rote model, the jth predictor is included if and only if its scaling parameter is 
nonzero. To further incorporate the strong heredity principle, we generalize the 
garrote as follows 

rnin^},^ E?=i [l " 2/i(ELx *iik*i + A>)1 , +AEj=i h (2.3) 



subject to 
and 



e s >o 



< 



Vr € V h Vj. 



(2.4) 



We have imposed a set of inequality constraints on the scaling parameters, 
besides the l\ constraint which ensures the sparsity of the estimates. Note that 
if 8j > 0, these linear inequalities in (|2.4p force the scaling parameters in T>j 
to be positive. Therefore, the resulting model always obeys the strong heredity 
principle. Furthermore, all the constraints are linear in terms of the scaling 
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parameters, and the feasible region under these constraints is convex. Therefore, 
solving (|2.3[) remains a linear program. 

The same idea can be applied to impose the weak heredity principle. The 
weak heredity principle says that if the jth variable is included in the model, at 
least one of the elements of T>j must be included. Observe that 

max 9 r > <^> at least one 9 r > 0, r € T>j and Vj. 
We could consider the following optimization problem 



En 
i=l 



subject to 
and 



Oj > Vj 
9j < max reT > j 8 r , Vj 



AEL^(2.5) 



(2.6) 



It is easy to see that the solution always obeys the weak heredity principle. 
However, the feasible region under such constraints is no longer convex. It is 
well known that non-convexity may cause various computational problems such 
as local minimizer and instability of the solution, etc. To overcome the non- 
convexity issue, we suggest to use the convex envelop of these constraints for 
the weak heredity principle 



mm^KA, EIU 1 " W(E?=i + ft) + A £f =1 6j (2.7) 



subject to 
and 



> 



— ^ 



ev 



Vj 

Or ,Vj. 



(2- 



reVj 



> and therefore at least 



Note that under ([2~8]) 0j > implies that 
one of its parents needs to be included in the model, which implies that the 
resulting model obeys the weak heredity principle. Since the constraints in (|2.8[) 
are linear and the feasible region under the constraints in (|2.7[) is convex, solving 
(|2.7[) remains a linear program. 

For the purpose of presentation, we refer the new SVMs defined in (|2.3[) and 
(1^71) to as SHSVM and WHSVM, respectively. 



2.2. Numerical studies 

We use numerical examples to demonstrate the benefits of incorporating hered- 
ity principles into the SVM model. 

In each simulated example, we generated 100 datasets, each with training 
samples of sizes n = 50, 100, and 200, and an independent test sample of size 
10000. In a benchmark example, 100 random partitions of the original data 
were created, each with a training sample and a test sample. In each example, 
all classifiers were fitted on a training sample and their generalization errors 
were computed on a test sample. Here the generalization error of a classifier / 
is Pi-(yf(x) < 0) under 0-1 loss. The Bayes rule minimizes the generalization 
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error and its error is called the Bayes error (risk). Note that the Bayes rule is 
argmax ce .n _u Pr(y = c\x) which is unknown in practice. In our simulation 
study we can compute the Bayes error because we know the true model. We 
reported the Bayes error and the averaged smallest generalization error of each 
competitor, thus avoiding the extra level of complexity in the comparison caused 
by the tuning parameter selection. 

We first consider three simulation models. In the first example the true model 
obeys the strong heredity principle, while in the second example the true model 
obeys the weak heredity principle. The third example concerns the situation 
when the true model does not obey any heredity principle. 

Simulation example 1. In the first set of simulation, the generated explana- 
tory variables z\, . . . ,z 7 are standard normal, where the correlation between z r 
and Zj is /9' r ~ J ', p — 0,0.5. The class labels are generated from a logistic regres- 
sion model 

log ( = 1 jf--^n = 2 Z1 + 4, 3 + 3** + 1. 
\Pi(y = -l\zi, ...,zr)J 

The predictor set for fitting the SVMs is {Zj,z r Zj,Zj}, r,j = 1,...,7. The 
predictor z r Zj represents the interaction between predictors z r and Zj, thus its 
parent effects are z r and Zj . The predictor z| represents the quadratic effect of 
Zj . We include the quadratic effect only if the linear main effect is included. Let 
0j and 8jj be the scaling parameters for Zj and z|, respectively. Let 9 r j be the 
scaling parameter for z r Zj (r =/= j). Then the linear constraints in ()2.4j) become 

9 r j < r and 9 r j < Qj, W ^ j, r, j = 1, . . . , 7 
9 n <9 3 , 1 7. 

The simulation results are summarized in Table [TJ From Table [T] we see that 
the SHSVM significantly outperforms the l\ and I2 SVMs in terms of classifica- 
tion accuracy regardless of sample sizes, although the differences get smaller as 
sample sizes increase. We also computed the frequency that the fitted li SVM 
obeys the strong heredity principle, as reported in the last column on Table [TJ 
The low frequency indicates that the l\ SVM is not appropriate when a strong 
heredity model is in demand. 

Simulation example 2. In the second set of simulation, we use the same 
setup in example [TJ except that the class labels are generated from a logistic 
regression model 

Pr(y = l\z\, ...,z 7 ) 



1 Pr(y = -l|zi, . . .,Z 7 ) / 
— 3.5zi + 3ziZ2 + 2.5ziZ3 + 2ziz^ + 1.5z\Zs + z\z§ + 1. 

This model obeys the weak heredity principle and violates the strong heredity 
principle. In order to fit the WHSVM, we note that the linear constraints in 
([2~8]) become 

9 r3 <9 r + 9 3 Vr^j, r,j = l,...,7 
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Table 1 

Simulation example^ The true model obeys the strong heredity principle. Compare the 
classification accuracy of the SHSVM, I2 SVM and l± SVM. The numbers in parentheses 
are standard errors and the frequency is the number of times the fitted l\ SVM obeys the 
strong heredity principle in 100 replications. 



p = n WHSVM l 2 SVM h SVM frequency 

50 0.186 (0.003) 0.279 (0.003) 0.206 (0.003) 11/100 

100 0.154 (0.001) 0.226 (0.002) 0.169 (0.002) 14/100 

200 0.145 (0.001) 0.196 (0.001) 0.149 (0.001) 17/100 



Bayes 0.133 

p = 0.5 n WHSVM l 2 SVM h SVM frequency 

50 0.173 (0.003) 0.248 (0.003) 0.190 (0.003) 12/100 

100 0.159 (0.002) 0.216 (0.002) 0.167 (0.002) 16/100 

200 0.143 (0.001) 0.188 (0.001) 0.147 (0.001) 20/100 



Bayes 0.130 



Table 2 

Simulation example^; The true model obeys the weak heredity principle. Compare the 
classification accuracy of the WHSVM, I2 SVM and l\ SVM. The numbers in parentheses 
are standard errors and the frequency is the number of times the fitted li SVM obeys the 
weak heredity principle in 100 replications. 



p = n WHSVM l 2 SVM h SVM frequency 

50 0.248 (0.003) 0.303 (0.003) 0.273 (0.003) 11/100 

100 0.198 (0.002) 0.253 (0.002) 0.216 (0.002) 19/100 

200 0.163 (0.001) 0.215 (0.002) 0.183 (0.001) 22/100 



Bayes 0.142 

p = 0.5 n WHSVM l 2 SVM h SVM frequency 

50 0.199 (0.001) 0.242 (0.002) 0.220 (0.002) 11/100 
100 0.164 (0.001) 0.211 (0.001) 0.181 (0.001) 14/100 
200 0.143 (0.001) 0.184 (0.001) 0.154 (0.001) 23/100 



Bayes 0.121 



Tabic [2] summarizes the simulation results. The WHSVM performs significantly 
better than the h SVM and the l 2 SVM. The last column in Table [2] shows 
the frequency that the fitted l\ SVM obeys the weak heredity principle. Again, 
these frequencies are pretty low. 

Simulation example 3. Examples Q] and [2] have demonstrated the benefits 
of recognizing the effect heredity. It would be interesting to investigate the 
performance of the SHSVM and the WHSVM when the true model actually 
violates the heredity principle. To this end, we considered the third example. 
We generated 5 explanatory variables and simulated the class labels from 

/ Pr(w = llzi, . . . ,z 5 ) \ 
\Pr(y = -l\zi, ...,z s )J 

We fitted the SHSVM, WHSVM, l 2 SVM and h SVM using the predictor set 
{Zj, z r Zj, Zj}, r, j — 1, . . . , 5. Since the true model is sparse, we expect that the 
I2 SVM has the worst performance. This is confirmed by the simulation results 
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Table 3 

Simulation example^ Compare the SHSVM, WHSVM, h SVM and h SVM when the true 
model obeys no heredity principle. The numbers in parentheses are standard errors. 



p = n 
50 
100 
200 


SHSVM 
0.171 (0.002) 
0.147 (0.002) 
0.131 (0.001) 


WHSVM l 2 SVM 
0.164 (0.002) 0.203 (0.003) 
0.140 (0.002) 0.173 (0.002) 
0.125 (0.001) 0.151 (0.001) 


h SVM 
0.172 (0.003) 
0.143 (0.002) 
0.127 (0.001) 


Bayes 




0.113 




p = 0.5 n 
50 
100 
200 


SHSVM 
0.138 (0.002) 
0.119 (0.001) 
0.109 (0.001) 


WHSVM l 2 SVM 
0.137 (0.002) 0.156 (0.002) 
0.115 (0.001) 0.134 (0.002) 
0.104 (0.001) 0.128 (0.001) 


h SVM 
0.139 (0.002) 
0.115 (0.001) 
0.105 (0.001) 


Bayes 




0.093 





in Table [3) We see that there is basically no difference between the WHSVM 
and the l\ SVM. This observation suggests that it does not hurt to enforce the 
heredity principle along with the sparsity, even when the true model does not 
obey the heredity principle. 

Birth weight data. We test the proposed structured SVMs on the birth 
weight data that concern the birth weight of 189 infants at a US hospital (lrl |. 
The problem of interest is to predict if the birth weight is lower than 2.5 kg. 
There are 8 explanatory variables depending upon mother's age (age), weight 
(Iwt), race (race), smoking status (smoke), number of previous premature labors 
(ptl), history of hypertension (ht), uterine irritability (ui), and number of physi- 
cian visits in the first trimester (ftv). The variables age and Iwt are continu- 
ous while dummy variables were used to represent the discrete- valued variables. 
Then the predictor set was generated as in the simulation models except that the 
quadratic effects of dummy variables were not included. For dummy variables, 
the heredity principles were applied to the group level. Because the sample size 
is only 189, we used 5-fold cross-validation to estimate the classification error 
of each method. 

As can be seen from Table [4j the structured SVMs significantly outperforms 
both the l 2 SVM and the l x SVM. The best h SVM model identifies 10 vari- 
ables including age 2 , age ■ Iwt, age ■ ftv, Iwt 2 , Iwt ■ race, Iwt ■ smoke, Iwt ■ ptl, 
Iwt ■ ht, Iwt ■ ui, and Iwt ■ ftv. This model does not satisfy the heredity princi- 
ples, because, for instance, age 2 and age ■ Iwt are included without their parent 
factor age. The frequencies of the l\ SVM model satisfying the strong and weak 
heredity principles were 10/20 and 14/20, respectively. The model selected by 
the WHSVM includes age, Iwt, and ftv together with the 10 variables in the l\ 
SVM model. The SHSVM model includes additional variables race, smoke, ptl, 
ht, ui, and ftv in the WHSVM model. 

One might wonder which heredity SVM should be used in this real data 
example. If the modeler does not have a strong preference in using either strong 
or weak heredity principle, the data suggest that the WHSVM is perhaps better 
than the SHSVM, since they have very similar classification performance and 
the WHSVM uses less variables. 
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Table 4 

Birth weight data: average five-fold cross validation errors with standard errors ( reported in 
parentheses) based on 30 replications. 

SHSVM WHSVM l 2 SVM h SVM 

0.291 (0.002) 0.294 (0.002) 0.307 (0.002) 0.305 (0.001) 



3. Nonparametric Heredity SVMs 



In the previous section we have discussed the heredity principle when each 
effect is represented by a single predictor. In many real world applications, we 
often need to nonparametrically model the main and interaction effects. Let us 
consider the following model where the class label y and explanatory variables 
Z\ , Z2 , . . . , Zq are related through 

* ( p^-t;;:::;:!) ) - £*(«>+£/«<*■«>■ <«> 

We have omitted the constant term for simplicity. The main effect of variable 
Zj is fj(zj) and the interaction effect between variables z r and Zj is f r j(z r , Zj). 
Obviously the above model is a generalization of the popular Generalized Addi- 
tive Model The model (|3.1[) can be more appropriate than the generalized 
additive model if interaction effects cannot be ignored. 

Under the strong heredity for the interaction effect f r j(z r ,Zj) to be active 
both its parent effects, f r (z r ) and fj(zj), should be active, whereas under the 
weak heredity only one of its parent effects needs to be active. In this section 
we develop a method that can automatically identify significant effects while 
respecting the heredity principle. 



3.1. Imposing heredity principles 

If we assume fj(zj) — ftjZj £md y r j(z r ,Zj) — {3 r jZ r Zj, then the model reduces 
to the parametric case. We show here that the parametric assumption is not 
necessary in order to implement the heredity principle by using the SVS frame- 
work. Suppose that we have found a good initial estimate of the full model (|3.ip 
and we denote the initial estimates by fj(Zj) and f r j(z r , Zj). We assign scaling 
parameters 9j to fj(zj) and 6 r j to f r j(z r ,Zj). The SHSVM can be formulated 
as follows 

min Er=l 1 - Vi(T,Ul fj( Z j) 9 J + Erj = l frj{Zr, Zj)6 r j) (3.2) 



subject to ELi Bj + 7=1 On < M 



j = l 3 r L^r,j=l w rj 

0j > 6 rj > Vr, j 
' r j < 8 r and 6 r j < Oj Vr, j. (3-3) 
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Likewise, we define the WHSVM as 



min Er=i 1 - y*(ELi fj( z iWj + Er,,-=i Ufar, zj)o r] ) (3.4) 



subject to 



E'=l + Er 

Oj ^ 



.3 = 1 

> 



Vr, j 
Vr, j. 



< M 



The final classifier is S 



(3.5) 

The linear inequal- 



Z jWj + Erj = l frj{ z r, Zj)9 rJ 

ities in (|3.3[) guarantee that the SHSVM obeys the strong heredity principle. 
Similarly, the linear inequalities in (|3.5p guarantee that the WHSVM obeys the 
weak heredity principle. Moreover, solving the scaling parameters is a linear 
program. 



3.2. Computing the initial estimator 

There are many nonparametric estimation methods that can give us a good 
initial estimator of the model (|3.ip . The choice of the estimation method is 
not essential for using the SHSVM and the WHSVM. In this work, for com- 
putational considerations, we obtain the initial estimates by using penalized 
B-splines Q. Penalized B-splines have been widely used in statistics for non- 
parametric function estimation (cf. Q, [ll[ and 17|). For each variable Zj, we 
take a basis of B-spline functions bj : k{zj) for k = 1,2,..., A^- for representing 
the function fj {zj ) . Then the N r x Nj dimensional tensor product basis defined 

by 

fl , fei,fe 2 (^r ) z ) = b r!kl (z r )b jt k 2 (zj), kx = 1,2, ...,N r and fc 2 = 1,2, .. . , Nj 

can be used for representing the interaction effect f r j(z r , Zj). With B-spline basis 
functions at hand, we can compute the I2 SVM estimate of the model (|3.1|) by 
minimizing 

q N } 

1 - Vi ( ao + 53X1 a 3khk(zj) 
^ j=i k=i 



i=l 



q N r Nj 

X/ X! X! ^jki^brMi^bjM^j) 
r,j=l ki — 1 ^2 — 1 



+ A||a||i 



+ 



where ||a||| = Ef=i + E'i=l Efc^i E£=i a^fe- Then the initial esti " 
mates are 

N, 



f]( z j) = ^Z^jkbj^izj), 

k=l 

N r Nj 

f rj (z r ,Zj) = ^ ^2 ^rj^kzbr.kAZr^jMiZ])- 



fci=l fc 2 =l 
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In computing the initial estimates, although there are q variables, the actual 
dimension of the predictor set is Y^=i Nj+J2 r 1=1 NjN r , which could be a large 
number. The quadratic penalty not only regularizes the nonparametric fit but 
also allows for an efficient implementation through singular value decomposition 
flol |. Let B denote the basis functions in the predictor set. We need to solve 

n 

(qo; &) = argmin Y]l — Di(a Q + BiO)] + + Xa T a. 

Qo,a — ' 
i— 1 

Suppose the singular value decomposition of B is B = UDV T = RV T where R 
is a n x n matrix, then we solve 

n 

(70,7) = argmin V[l - yi(a + Ri"f)}+ + X^l, 
70,7 ~ • 

and a — and do =70- See theorem 1 in [1 01 ] - Therefore, the computations 
can be done in a n dimensional space instead of the original high-dimensional 
predictor space. 

3.3. Numerical examples 

We now present some numerical examples to demonstrate the performance of the 
nonparametric heredity SVMs. We compared the nonparametric heredity SVMs 
with the I2 SVM and the Gaussian kernel SVM. In all examples, the I2 SVM 
was fitted using the same B-Spline basis functions for fitting the nonparametric 
heredity SVMs. 

Simulation example 4. We first generated explanatory variables zi,. . .,25 
from a multivariate normal distribution in which the correlation between z r 
and Zj is 0.5' r ~ : ''. We considered a sparse model where the class labels were 
generated from a logistic regression model 

log ( * T ( {y = 1 \ Z , 1, "- ,Z5 \ ) = Mzi) + h{z 2 ) + fl 2 (Zl,Z 2 ) + 1. 
\Pr(y = -l\zi, ...,z 5 )J 

The true model obeys the strong heredity principle. We used five B-spline ba- 
sis functions {bj t i(zj), bj^Zj)} to represent each fj(zj), and the interac- 
tion effect f r j(z ri Zj) was represented by the tensor product basis functions 
{b r i(z r )bj } i(zj), b r ^{z r )bj^(,Zj), . . ., b r ^(z r )bj^(zj)}. The representing coeffi- 
cients (a) were chosen as follows: (i) Coefficients of the 5 basis functions for 
/i(zi) are (2.1, —2.9, 0.3, 2.7, —0.1), (ii) coefficients of the 5 basis functions 
for f2{z2) are (—2.8, —1.2, 1.8, 1.7, —0.8), and (iii) coefficients of the 25 basis 
functions for fn{z\,Z2) are (-2.4, -0.1, 0.6, 3, 2.8, -0.9, 0.3, 1, -0.9, -1.3, 
0.9, 2.3, 1.9, 0.8, -0.2, 1.2, 2.1, 1.0, -0.8, -1.7, -0.8, -1.2, 2.1, -2.8, 0.1). 

It should be mentioned that in this model the dimension of the predictor set 
is 275. On the other hand, this model is very sparse in terms of the number 
of active effects (only three active effects). We simulated a training sample of 
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Table 5 

Compare the SHSVM, the I2 SVM and the Gaussian kernel SVM when the true model obeys 
the strong heredity principle. The numbers in parentheses are standard errors. 



SHSVM 


l 2 SVM 


GK-SVM 


Bayes 


0.209 (0.001) 


0.214 (0.001) 


0.218 (0.001) 


0.202 



Table 6 

Compare the WHSVM, the li SVM and the Gaussian kernel SVM when the true model 
obeys the strong heredity principle. The numbers in parentheses are standard errors. 

WHSVM l 2 SVM GK-SVM Bayes ~ 

0.217 (0.001) 0.226 (0.001) 0.222 (0.001) 0.197 ~ 



size 100 from the above model and collected an independent test sample of size 
10000 to compute the generalization error of each competitor. The simulation 
was repeated 100 times. The simulations results are summarized in Table [5] 
from which two interesting observations can be made. First we see that the I2 
SVM actually does better than the Gaussian kernel SVM in this example. This 
observation suggests that although the Gaussian kernel SVM is perhaps the 
most popular nonparametric SVM classifier, it is not always the best choice in 
all problems. Second and more importantly, the SHSVM is clearly the winner 
among all three competitors. 

Simulation example 5. In this example we considered the same setup in 
example |H except that the class labels were generated from a logistic regression 
model 

log ( ^ ( {y=1 \ Z l 1 '"' ,Z5 \ ) = + / 2 (*a) + fu(zi, z 5 ) + f2 3 (z 2 ,z 3 ) - 1. 

\Pr(y = -l\zi, ...,z 5 )J 

Hence this model obeys the weak heredity principle. As in example[4j we used B- 
splines to model each effect. The representing coefficients are chosen as follows: 

(i) Coefficients of the 5 basis functions for /i(zi) are (3.0, —2.5, 2.0, —1.5, 1.0), 

(ii) coefficients of the 5 basis functions for 72(^2) are (1.5, 2.0, —3.0, —2.5, —2.0), 
(hi) coefficients of the 25 basis functions for /is^IjZs) are (7.1, —9.8, 1.1, 9.0, 
-0.3, -8.1, -0.4, 2.0, 10, 9.4, -3.1, 1.0, 3.2, -3.1, -4.3, 3.1, 7.7, 6.2, 2.7, 
—0.7, 3.9, 6.8, 3.4, —2.5, —5.6), and (iv) coefficients of the 25 basis functions for 
/2 3 (z2,z 3 ) are (-2-6, -3.8, 7.0, -9.4, 0.5, -9.2, -4.0, 6.1, 5.6, -2.7, 5.5, 9.3, 
-5.4, 9.1, -2.8, 5.1, 3.9, 6.6, -0.6, 6.8, 0.8, 8, -3.6, -2.5, -6). 

As can be seen from Table [51 in this example the Gaussian kernel SVM 
outperforms the I2 SVM, but the best performance is given by the WHSVM. 

South African Heart Disease Data. Here we demonstrate the utility 
of the nonparametric heredity SVMs through an analysis of the South African 
heart disease data [llj which consist of 462 samples of 9 risk factors (8 continuous 
and 1 binary). The responses indicates the presence of heart disease. Previous 
studies of this data suggest that nonparametric functions should be used to 
model the effects of these 9 risk factors. We first used the popular Gaussian 
kernel SVM to analyze the data whose classification error can be used as a good 
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Table 7 

South African heart disease data: average 5-fold cross-validation errors based on 30 
replications. The numbers in parentheses are standard errors. 



SHSVM 


WHSVM 


GK-SVM 


0.267 (0.001) 


0.270 (0.001) 


0.276 (0.001) 



benchmark for comparison. To fit the SHSVM and the WHSVM, we used B- 
splines to flexibly model the main effects of 8 continuous risk factors and use the 
tensor product basis functions of B-splines to model the interaction effects. In 
total, there are 33 basis functions and 480 basis functions used for representing 
the main effects and the interaction effects, respectively. Since we did not have 
an independent test set, we found the smallest 5-fold cross-validation error of 
each competitor. Then we repeated the whole procedure 30 times and reported 
the average 5-fold cross-validation errors. As can be seen from Table [7J the 
SHSVM does significantly better than the Gaussian kernel SVM. 

4. Discussion 

In this paper we have developed a unified framework for simultaneously incor- 
porating the heredity principle and sparsity into the support vector machine. 
By adopting the scaling parameter idea from the nonnegative garrote, we have 
shown that both strong and weak heredity principles can be enforced by a set of 
linear inequality constraints on the scaling parameters. Our approach is compu- 
tationally efficient, as the optimization problem a linear program. Moreover, we 
have also extended the framework to handle nonparametric models, which shows 
the flexibility of our method. The encouraging numerical results suggest that 
the newly proposed method is a useful addition to the classification toolbox. 

To fix the main idea, we have used the penalized h SVM to construct the 
initial classifier. Based on our experience, this choice of initial classifier worked 
quite well even when the dimension of predictors exceeds the sample size. It is 
possible to further improve the heredity SVMs by using better initial classifiers 
in certain problems. 

Finally, we comment on the path-based computation of the structured SVMs. 
Yuan and Lin [22j showed that the solution path of the original nonnegative 
garrote is piecewise linear and constructed an efficient algorithm for building 
its whole solution path. One may expect the same is true for the garrote SVM. 
With the heredity constraints, the solution paths of 9s will remain piece- wise 
linear as a function of their l\ norm. However, the path-following algorithm will 
become considerably more complicated. It is not clear if computing the whole 
solution path will provide us considerable computational savings, compared with 
running linear programming for a grid of tuning parameters. 
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